Template Credit: Adapted from a template made available by Dr. Jason Brownlee of Machine Learning Mastery. https://machinelearningmastery.com/

SUMMARY: The purpose of this project is to construct a prediction model using various machine learning algorithms and to document the end-to-end steps using a template. The Online News Popularity dataset is a regression situation where we are trying to predict the value of a continuous variable.

INTRODUCTION: This dataset summarizes a heterogeneous set of features about articles published by Mashable in a period of two years. The goal is to predict the article’s popularity level in social networks. The dataset does not contain the original content, but some statistics associated with it. The original content can be publicly accessed and retrieved using the provided URLs.

Many thanks to K. Fernandes, P. Vinagre and P. Cortez. A Proactive Intelligent Decision Support System for Predicting the Popularity of Online News. Proceedings of the 17th EPIA 2015 - Portuguese Conference on Artificial Intelligence, September, Coimbra, Portugal, for making the dataset and benchmarking information available.

In iteration Take1, the script focused on evaluating various machine learning algorithms and identifying the algorithm that produces the best accuracy result. Iteration Take1 established a baseline performance regarding accuracy and processing time.

In iteration Take2, we examined the feasibility of using a dimensionality reduction technique of ranking the attribute importance with a gradient boosting tree method. Afterward, we eliminated the features that do not contribute to the cumulative importance of 0.99 (or 99%).

For this iteration, we will explore the Recursive Feature Elimination (or RFE) technique by recursively removing attributes and building a model on those attributes that remain. To keep the training time manageable, we will limit the number of attributes to 50.

ANALYSIS: From the previous iteration Take1, the baseline performance of the machine learning algorithms achieved an average RMSE of 10446. Two algorithms (Random Forest and Stochastic Gradient Boosting) achieved the top RMSE scores after the first round of modeling. After a series of tuning trials, Random Forest turned in the top result using the training data. It achieved the best RMSE of 10299. Using the optimized tuning parameter available, the Random Forest algorithm processed the validation dataset with an RMSE of 12978, which was slightly worse than the accuracy of the training data and possibly due to over-fitting.

From the previous iteration Take2, the baseline performance of the machine learning algorithms achieved an average RMSE of 10409. Two algorithms (ElasticNet and Stochastic Gradient Boosting) achieved the top RMSE scores after the first round of modeling. After a series of tuning trials, Stochastic Gradient Boosting turned in the top result using the training data. It achieved the best RMSE of 10312. Using the optimized tuning parameter available, the Stochastic Gradient Boosting algorithm processed the validation dataset with an RMSE of 13007, which was worse than the accuracy of the training data and possibly due to over-fitting.

In the current iteration, the baseline performance of the machine learning algorithms achieved an average RMSE of 10503. Two algorithms (Ridge, LASSO, and ElasticNet) achieved the top RMSE scores after the first round of modeling. After a series of tuning trials, ElasticNet turned in the top result using the training data. It achieved the best RMSE of 10320. Using the optimized tuning parameter available, the ElasticNet algorithm processed the validation dataset with an RMSE of 13049, which was worse than the accuracy of the training data and possibly due to over-fitting.

From the model-building activities, the number of attributes went from 58 down to 48 after eliminating 10 attributes. The processing time went from 21 hours 7 minutes in iteration Take1 down to 14 hours 49 minutes in iteration Take3, which was a reduction of 29% from Take1. The processing time, however, was a slightly increase from Take2, which processed the dataset in 11 hours 41 minutes.

CONCLUSION: The two feature selection techniques yielded different attribute selection sets and outcomes. For this dataset, the Stochastic Gradient Boosting algorithm and the attribute importance ranking technique from iteration Take2 should be considered for further modeling or production use.

Dataset Used: Online News Popularity Dataset

Dataset ML Model: Regression with numerical attributes

Dataset Reference: https://archive.ics.uci.edu/ml/datasets/Online+News+Popularity

The project aims to touch on the following areas:

  1. Document a predictive modeling problem end-to-end.
  2. Explore data cleaning and transformation options
  3. Explore non-ensemble and ensemble algorithms for baseline model performance
  4. Explore algorithm tuning techniques for improving model performance

Any predictive modeling machine learning project genrally can be broken down into about six major tasks:

  1. Prepare Problem
  2. Summarize Data
  3. Prepare Data
  4. Model and Evaluate Algorithms
  5. Improve Accuracy or Results
  6. Finalize Model and Present Results

1. Prepare Problem

1.a) Load libraries

startTimeScript <- proc.time()
library(caret)
## Loading required package: lattice
## Loading required package: ggplot2
library(corrplot)
## corrplot 0.84 loaded
library(parallel)
library(mailR)

# Create one random seed number for reproducible results
seedNum <- 888
set.seed(seedNum)

1.b) Load dataset

originalDataset <- read.csv("OnlineNewsPopularity.csv", header= TRUE)

# Dropping the two non-predictive attributes: url and timedelta
originalDataset$url <- NULL
originalDataset$timedelta <- NULL

# Different ways of reading and processing the input dataset. Saving these for future references.
#x_train <- read.fwf("X_train.txt", widths = widthVector, col.names = colNames)
#y_train <- read.csv("y_train.txt", header = FALSE, col.names = c("targetVar"))
#y_train$targetVar <- as.factor(y_train$targetVar)
#xy_train <- cbind(x_train, y_train)
# Use variable totCol to hold the number of columns in the dataframe
totCol <- ncol(originalDataset)

# Set up variable totAttr for the total number of attribute columns
totAttr <- totCol-1
# targetCol variable indicates the column location of the target/class variable
# If the first column, set targetCol to 1. If the last column, set targetCol to totCol
# if (targetCol <> 1) and (targetCol <> totCol), be aware when slicing up the dataframes for visualization! 
targetCol <- totCol
colnames(originalDataset)[targetCol] <- "targetVar"
# We create training datasets (xy_train, x_train, y_train) for various operations.
# We create validation datasets (xy_test, x_test, y_test) for various operations.
set.seed(seedNum)

# Create a list of the rows in the original dataset we can use for training
training_index <- createDataPartition(originalDataset$targetVar, p=0.70, list=FALSE)
# Use 70% of the data to train the models and the remaining for testing/validation
xy_train <- originalDataset[training_index,]
xy_test <- originalDataset[-training_index,]

if (targetCol==1) {
x_train <- xy_train[,(targetCol+1):totCol]
y_train <- xy_train[,targetCol]
y_test <- xy_test[,targetCol]
} else {
x_train <- xy_train[,1:(totAttr)]
y_train <- xy_train[,totCol]
y_test <- xy_test[,totCol]
}

1.c) Set up the key parameters to be used in the script

# Set up the number of row and columns for visualization display. dispRow * dispCol should be >= totAttr
dispCol <- 4
if (totAttr%%dispCol == 0) {
dispRow <- totAttr%/%dispCol
} else {
dispRow <- (totAttr%/%dispCol) + 1
}
cat("Will attempt to create graphics grid (col x row): ", dispCol, ' by ', dispRow)
## Will attempt to create graphics grid (col x row):  4  by  15

1.d) Set test options and evaluation metric

# Run algorithms using 10-fold cross validation
control <- trainControl(method="repeatedcv", number=10, repeats=1)
metricTarget <- "RMSE"

1.e) Set up the email notification function

email_notify <- function(msg=""){
  sender <- "luozhi2488@gmail.com"
  receiver <- "dave@contactdavidlowe.com"
  sbj_line <- "Notification from R Script"
  password <- readLines("email_credential.txt")
  send.mail(
    from = sender,
    to = receiver,
    subject= sbj_line,
    body = msg,
    smtp = list(host.name = "smtp.gmail.com", port = 465, user.name = sender, passwd = password, ssl = TRUE),
    authenticate = TRUE,
    send = TRUE)
}
email_notify(paste("Library and Data Loading Completed!",date()))
## [1] "Java-Object{org.apache.commons.mail.SimpleEmail@47fd17e3}"

2. Summarize Data

To gain a better understanding of the data that we have on-hand, we will leverage a number of descriptive statistics and data visualization techniques. The plan is to use the results to consider new questions, review assumptions, and validate hypotheses that we can investigate later with specialized models.

2.a) Descriptive statistics

2.a.i) Peek at the data itself.

head(xy_train)
##   n_tokens_title n_tokens_content n_unique_tokens n_non_stop_words
## 2              9              255       0.6047431                1
## 3              9              211       0.5751295                1
## 5             13             1072       0.4156456                1
## 6             10              370       0.5598886                1
## 7              8              960       0.4181626                1
## 8             12              989       0.4335736                1
##   n_non_stop_unique_tokens num_hrefs num_self_hrefs num_imgs num_videos
## 2                0.7919463         3              1        1          0
## 3                0.6638655         3              1        1          0
## 5                0.5408895        19             19       20          0
## 6                0.6981982         2              2        0          0
## 7                0.5498339        21             20       20          0
## 8                0.5721078        20             20       20          0
##   average_token_length num_keywords data_channel_is_lifestyle
## 2             4.913725            4                         0
## 3             4.393365            6                         0
## 5             4.682836            7                         0
## 6             4.359459            9                         0
## 7             4.654167           10                         1
## 8             4.617796            9                         0
##   data_channel_is_entertainment data_channel_is_bus data_channel_is_socmed
## 2                             0                   1                      0
## 3                             0                   1                      0
## 5                             0                   0                      0
## 6                             0                   0                      0
## 7                             0                   0                      0
## 8                             0                   0                      0
##   data_channel_is_tech data_channel_is_world kw_min_min kw_max_min
## 2                    0                     0          0          0
## 3                    0                     0          0          0
## 5                    1                     0          0          0
## 6                    1                     0          0          0
## 7                    0                     0          0          0
## 8                    1                     0          0          0
##   kw_avg_min kw_min_max kw_max_max kw_avg_max kw_min_avg kw_max_avg
## 2          0          0          0          0          0          0
## 3          0          0          0          0          0          0
## 5          0          0          0          0          0          0
## 6          0          0          0          0          0          0
## 7          0          0          0          0          0          0
## 8          0          0          0          0          0          0
##   kw_avg_avg self_reference_min_shares self_reference_max_shares
## 2          0                         0                         0
## 3          0                       918                       918
## 5          0                       545                     16000
## 6          0                      8500                      8500
## 7          0                       545                     16000
## 8          0                       545                     16000
##   self_reference_avg_sharess weekday_is_monday weekday_is_tuesday
## 2                      0.000                 1                  0
## 3                    918.000                 1                  0
## 5                   3151.158                 1                  0
## 6                   8500.000                 1                  0
## 7                   3151.158                 1                  0
## 8                   3151.158                 1                  0
##   weekday_is_wednesday weekday_is_thursday weekday_is_friday
## 2                    0                   0                 0
## 3                    0                   0                 0
## 5                    0                   0                 0
## 6                    0                   0                 0
## 7                    0                   0                 0
## 8                    0                   0                 0
##   weekday_is_saturday weekday_is_sunday is_weekend     LDA_00     LDA_01
## 2                   0                 0          0 0.79975569 0.05004668
## 3                   0                 0          0 0.21779229 0.03333446
## 5                   0                 0          0 0.02863281 0.02879355
## 6                   0                 0          0 0.02224528 0.30671758
## 7                   0                 0          0 0.02008167 0.11470539
## 8                   0                 0          0 0.02222436 0.15073297
##       LDA_02     LDA_03     LDA_04 global_subjectivity
## 2 0.05009625 0.05010067 0.05000071           0.3412458
## 3 0.03335142 0.03333354 0.68218829           0.7022222
## 5 0.02857518 0.02857168 0.88542678           0.5135021
## 6 0.02223128 0.02222429 0.62658158           0.4374086
## 7 0.02002437 0.02001533 0.82517325           0.5144803
## 8 0.24343548 0.02222360 0.56138359           0.5434742
##   global_sentiment_polarity global_rate_positive_words
## 2                0.14894781                 0.04313725
## 3                0.32333333                 0.05687204
## 5                0.28100348                 0.07462687
## 6                0.07118419                 0.02972973
## 7                0.26830272                 0.08020833
## 8                0.29861347                 0.08392315
##   global_rate_negative_words rate_positive_words rate_negative_words
## 2                0.015686275           0.7333333           0.2666667
## 3                0.009478673           0.8571429           0.1428571
## 5                0.012126866           0.8602151           0.1397849
## 6                0.027027027           0.5238095           0.4761905
## 7                0.016666667           0.8279570           0.1720430
## 8                0.015166835           0.8469388           0.1530612
##   avg_positive_polarity min_positive_polarity max_positive_polarity
## 2             0.2869146            0.03333333                   0.7
## 3             0.4958333            0.10000000                   1.0
## 5             0.4111274            0.03333333                   1.0
## 6             0.3506100            0.13636364                   0.6
## 7             0.4020386            0.10000000                   1.0
## 8             0.4277205            0.10000000                   1.0
##   avg_negative_polarity min_negative_polarity max_negative_polarity
## 2            -0.1187500                -0.125            -0.1000000
## 3            -0.4666667                -0.800            -0.1333333
## 5            -0.2201923                -0.500            -0.0500000
## 6            -0.1950000                -0.400            -0.1000000
## 7            -0.2244792                -0.500            -0.0500000
## 8            -0.2427778                -0.500            -0.0500000
##   title_subjectivity title_sentiment_polarity abs_title_subjectivity
## 2          0.0000000                0.0000000             0.50000000
## 3          0.0000000                0.0000000             0.50000000
## 5          0.4545455                0.1363636             0.04545455
## 6          0.6428571                0.2142857             0.14285714
## 7          0.0000000                0.0000000             0.50000000
## 8          1.0000000                0.5000000             0.50000000
##   abs_title_sentiment_polarity targetVar
## 2                    0.0000000       711
## 3                    0.0000000      1500
## 5                    0.1363636       505
## 6                    0.2142857       855
## 7                    0.0000000       556
## 8                    0.5000000       891

2.a.ii) Dimensions of the dataset.

dim(xy_train)
## [1] 27752    59
dim(xy_test)
## [1] 11892    59

2.a.iii) Types of the attributes.

sapply(xy_train, class)
##                n_tokens_title              n_tokens_content 
##                     "numeric"                     "numeric" 
##               n_unique_tokens              n_non_stop_words 
##                     "numeric"                     "numeric" 
##      n_non_stop_unique_tokens                     num_hrefs 
##                     "numeric"                     "numeric" 
##                num_self_hrefs                      num_imgs 
##                     "numeric"                     "numeric" 
##                    num_videos          average_token_length 
##                     "numeric"                     "numeric" 
##                  num_keywords     data_channel_is_lifestyle 
##                     "numeric"                     "numeric" 
## data_channel_is_entertainment           data_channel_is_bus 
##                     "numeric"                     "numeric" 
##        data_channel_is_socmed          data_channel_is_tech 
##                     "numeric"                     "numeric" 
##         data_channel_is_world                    kw_min_min 
##                     "numeric"                     "numeric" 
##                    kw_max_min                    kw_avg_min 
##                     "numeric"                     "numeric" 
##                    kw_min_max                    kw_max_max 
##                     "numeric"                     "numeric" 
##                    kw_avg_max                    kw_min_avg 
##                     "numeric"                     "numeric" 
##                    kw_max_avg                    kw_avg_avg 
##                     "numeric"                     "numeric" 
##     self_reference_min_shares     self_reference_max_shares 
##                     "numeric"                     "numeric" 
##    self_reference_avg_sharess             weekday_is_monday 
##                     "numeric"                     "numeric" 
##            weekday_is_tuesday          weekday_is_wednesday 
##                     "numeric"                     "numeric" 
##           weekday_is_thursday             weekday_is_friday 
##                     "numeric"                     "numeric" 
##           weekday_is_saturday             weekday_is_sunday 
##                     "numeric"                     "numeric" 
##                    is_weekend                        LDA_00 
##                     "numeric"                     "numeric" 
##                        LDA_01                        LDA_02 
##                     "numeric"                     "numeric" 
##                        LDA_03                        LDA_04 
##                     "numeric"                     "numeric" 
##           global_subjectivity     global_sentiment_polarity 
##                     "numeric"                     "numeric" 
##    global_rate_positive_words    global_rate_negative_words 
##                     "numeric"                     "numeric" 
##           rate_positive_words           rate_negative_words 
##                     "numeric"                     "numeric" 
##         avg_positive_polarity         min_positive_polarity 
##                     "numeric"                     "numeric" 
##         max_positive_polarity         avg_negative_polarity 
##                     "numeric"                     "numeric" 
##         min_negative_polarity         max_negative_polarity 
##                     "numeric"                     "numeric" 
##            title_subjectivity      title_sentiment_polarity 
##                     "numeric"                     "numeric" 
##        abs_title_subjectivity  abs_title_sentiment_polarity 
##                     "numeric"                     "numeric" 
##                     targetVar 
##                     "integer"

2.a.iv) Statistical summary of all attributes.

summary(xy_train)
##  n_tokens_title n_tokens_content n_unique_tokens    n_non_stop_words  
##  Min.   : 3.0   Min.   :   0.0   Min.   :  0.0000   Min.   :   0.000  
##  1st Qu.: 9.0   1st Qu.: 246.0   1st Qu.:  0.4707   1st Qu.:   1.000  
##  Median :10.0   Median : 409.0   Median :  0.5393   Median :   1.000  
##  Mean   :10.4   Mean   : 547.2   Mean   :  0.5555   Mean   :   1.008  
##  3rd Qu.:12.0   3rd Qu.: 716.0   3rd Qu.:  0.6081   3rd Qu.:   1.000  
##  Max.   :23.0   Max.   :8474.0   Max.   :701.0000   Max.   :1042.000  
##  n_non_stop_unique_tokens   num_hrefs      num_self_hrefs  
##  Min.   :  0.0000         Min.   :  0.00   Min.   : 0.000  
##  1st Qu.:  0.6255         1st Qu.:  4.00   1st Qu.: 1.000  
##  Median :  0.6903         Median :  7.00   Median : 3.000  
##  Mean   :  0.6957         Mean   : 10.88   Mean   : 3.302  
##  3rd Qu.:  0.7542         3rd Qu.: 14.00   3rd Qu.: 4.000  
##  Max.   :650.0000         Max.   :304.00   Max.   :74.000  
##     num_imgs         num_videos     average_token_length  num_keywords   
##  Min.   :  0.000   Min.   : 0.000   Min.   :0.000        Min.   : 1.000  
##  1st Qu.:  1.000   1st Qu.: 0.000   1st Qu.:4.477        1st Qu.: 6.000  
##  Median :  1.000   Median : 0.000   Median :4.662        Median : 7.000  
##  Mean   :  4.563   Mean   : 1.262   Mean   :4.546        Mean   : 7.227  
##  3rd Qu.:  4.000   3rd Qu.: 1.000   3rd Qu.:4.854        3rd Qu.: 9.000  
##  Max.   :111.000   Max.   :91.000   Max.   :6.610        Max.   :10.000  
##  data_channel_is_lifestyle data_channel_is_entertainment
##  Min.   :0.00000           Min.   :0.000                
##  1st Qu.:0.00000           1st Qu.:0.000                
##  Median :0.00000           Median :0.000                
##  Mean   :0.05387           Mean   :0.178                
##  3rd Qu.:0.00000           3rd Qu.:0.000                
##  Max.   :1.00000           Max.   :1.000                
##  data_channel_is_bus data_channel_is_socmed data_channel_is_tech
##  Min.   :0.0000      Min.   :0.00000        Min.   :0.0000      
##  1st Qu.:0.0000      1st Qu.:0.00000        1st Qu.:0.0000      
##  Median :0.0000      Median :0.00000        Median :0.0000      
##  Mean   :0.1579      Mean   :0.05801        Mean   :0.1864      
##  3rd Qu.:0.0000      3rd Qu.:0.00000        3rd Qu.:0.0000      
##  Max.   :1.0000      Max.   :1.00000        Max.   :1.0000      
##  data_channel_is_world   kw_min_min       kw_max_min       kw_avg_min     
##  Min.   :0.0000        Min.   : -1.00   Min.   :     0   Min.   :   -1.0  
##  1st Qu.:0.0000        1st Qu.: -1.00   1st Qu.:   450   1st Qu.:  141.9  
##  Median :0.0000        Median : -1.00   Median :   662   Median :  235.1  
##  Mean   :0.2092        Mean   : 26.13   Mean   :  1159   Mean   :  313.8  
##  3rd Qu.:0.0000        3rd Qu.:  4.00   3rd Qu.:  1000   3rd Qu.:  356.8  
##  Max.   :1.0000        Max.   :377.00   Max.   :298400   Max.   :42827.9  
##    kw_min_max       kw_max_max       kw_avg_max       kw_min_avg  
##  Min.   :     0   Min.   :     0   Min.   :     0   Min.   :  -1  
##  1st Qu.:     0   1st Qu.:843300   1st Qu.:172048   1st Qu.:   0  
##  Median :  1400   Median :843300   Median :245025   Median :1034  
##  Mean   : 13458   Mean   :752066   Mean   :259524   Mean   :1122  
##  3rd Qu.:  7900   3rd Qu.:843300   3rd Qu.:331986   3rd Qu.:2066  
##  Max.   :843300   Max.   :843300   Max.   :843300   Max.   :3613  
##    kw_max_avg       kw_avg_avg    self_reference_min_shares
##  Min.   :     0   Min.   :    0   Min.   :     0           
##  1st Qu.:  3564   1st Qu.: 2386   1st Qu.:   638           
##  Median :  4358   Median : 2870   Median :  1200           
##  Mean   :  5640   Mean   : 3137   Mean   :  4084           
##  3rd Qu.:  6021   3rd Qu.: 3605   3rd Qu.:  2600           
##  Max.   :298400   Max.   :43568   Max.   :843300           
##  self_reference_max_shares self_reference_avg_sharess weekday_is_monday
##  Min.   :     0            Min.   :     0             Min.   :0.0000   
##  1st Qu.:  1100            1st Qu.:   985             1st Qu.:0.0000   
##  Median :  2800            Median :  2200             Median :0.0000   
##  Mean   : 10164            Mean   :  6380             Mean   :0.1689   
##  3rd Qu.:  7900            3rd Qu.:  5100             3rd Qu.:0.0000   
##  Max.   :843300            Max.   :843300             Max.   :1.0000   
##  weekday_is_tuesday weekday_is_wednesday weekday_is_thursday
##  Min.   :0.0000     Min.   :0.0000       Min.   :0.0000     
##  1st Qu.:0.0000     1st Qu.:0.0000       1st Qu.:0.0000     
##  Median :0.0000     Median :0.0000       Median :0.0000     
##  Mean   :0.1865     Mean   :0.1886       Mean   :0.1833     
##  3rd Qu.:0.0000     3rd Qu.:0.0000       3rd Qu.:0.0000     
##  Max.   :1.0000     Max.   :1.0000       Max.   :1.0000     
##  weekday_is_friday weekday_is_saturday weekday_is_sunday   is_weekend    
##  Min.   :0.0000    Min.   :0.00000     Min.   :0.00000   Min.   :0.0000  
##  1st Qu.:0.0000    1st Qu.:0.00000     1st Qu.:0.00000   1st Qu.:0.0000  
##  Median :0.0000    Median :0.00000     Median :0.00000   Median :0.0000  
##  Mean   :0.1434    Mean   :0.06191     Mean   :0.06735   Mean   :0.1293  
##  3rd Qu.:0.0000    3rd Qu.:0.00000     3rd Qu.:0.00000   3rd Qu.:0.0000  
##  Max.   :1.0000    Max.   :1.00000     Max.   :1.00000   Max.   :1.0000  
##      LDA_00            LDA_01            LDA_02            LDA_03       
##  Min.   :0.00000   Min.   :0.00000   Min.   :0.00000   Min.   :0.00000  
##  1st Qu.:0.02505   1st Qu.:0.02501   1st Qu.:0.02857   1st Qu.:0.02857  
##  Median :0.03339   Median :0.03334   Median :0.04000   Median :0.04000  
##  Mean   :0.18415   Mean   :0.14087   Mean   :0.21465   Mean   :0.22515  
##  3rd Qu.:0.24039   3rd Qu.:0.15034   3rd Qu.:0.32802   3rd Qu.:0.38152  
##  Max.   :0.92699   Max.   :0.92595   Max.   :0.92000   Max.   :0.91998  
##      LDA_04        global_subjectivity global_sentiment_polarity
##  Min.   :0.00000   Min.   :0.0000      Min.   :-0.38021         
##  1st Qu.:0.02857   1st Qu.:0.3955      1st Qu.: 0.05712         
##  Median :0.04073   Median :0.4534      Median : 0.11867         
##  Mean   :0.23514   Mean   :0.4430      Mean   : 0.11861         
##  3rd Qu.:0.40359   3rd Qu.:0.5083      3rd Qu.: 0.17700         
##  Max.   :0.92712   Max.   :1.0000      Max.   : 0.65500         
##  global_rate_positive_words global_rate_negative_words rate_positive_words
##  Min.   :0.00000            Min.   :0.000000           Min.   :0.0000     
##  1st Qu.:0.02834            1st Qu.:0.009662           1st Qu.:0.6000     
##  Median :0.03888            Median :0.015326           Median :0.7097     
##  Mean   :0.03955            Mean   :0.016647           Mean   :0.6815     
##  3rd Qu.:0.05025            3rd Qu.:0.021739           3rd Qu.:0.8000     
##  Max.   :0.15217            Max.   :0.184932           Max.   :1.0000     
##  rate_negative_words avg_positive_polarity min_positive_polarity
##  Min.   :0.0000      Min.   :0.0000        Min.   :0.00000      
##  1st Qu.:0.1852      1st Qu.:0.3056        1st Qu.:0.05000      
##  Median :0.2800      Median :0.3583        Median :0.10000      
##  Mean   :0.2884      Mean   :0.3532        Mean   :0.09536      
##  3rd Qu.:0.3846      3rd Qu.:0.4108        3rd Qu.:0.10000      
##  Max.   :1.0000      Max.   :1.0000        Max.   :1.00000      
##  max_positive_polarity avg_negative_polarity min_negative_polarity
##  Min.   :0.0000        Min.   :-1.0000       Min.   :-1.0000      
##  1st Qu.:0.6000        1st Qu.:-0.3282       1st Qu.:-0.7000      
##  Median :0.8000        Median :-0.2536       Median :-0.5000      
##  Mean   :0.7553        Mean   :-0.2596       Mean   :-0.5222      
##  3rd Qu.:1.0000        3rd Qu.:-0.1873       3rd Qu.:-0.3000      
##  Max.   :1.0000        Max.   : 0.0000       Max.   : 0.0000      
##  max_negative_polarity title_subjectivity title_sentiment_polarity
##  Min.   :-1.0000       Min.   :0.0000     Min.   :-1.00000        
##  1st Qu.:-0.1250       1st Qu.:0.0000     1st Qu.: 0.00000        
##  Median :-0.1000       Median :0.1429     Median : 0.00000        
##  Mean   :-0.1073       Mean   :0.2819     Mean   : 0.07093        
##  3rd Qu.:-0.0500       3rd Qu.:0.5000     3rd Qu.: 0.13750        
##  Max.   : 0.0000       Max.   :1.0000     Max.   : 1.00000        
##  abs_title_subjectivity abs_title_sentiment_polarity   targetVar     
##  Min.   :0.0000         Min.   :0.0000               Min.   :     4  
##  1st Qu.:0.1667         1st Qu.:0.0000               1st Qu.:   946  
##  Median :0.5000         Median :0.0000               Median :  1400  
##  Mean   :0.3419         Mean   :0.1558               Mean   :  3366  
##  3rd Qu.:0.5000         3rd Qu.:0.2500               3rd Qu.:  2800  
##  Max.   :0.5000         Max.   :1.0000               Max.   :690400

2.a.v) Count missing values.

sapply(xy_train, function(x) sum(is.na(x)))
##                n_tokens_title              n_tokens_content 
##                             0                             0 
##               n_unique_tokens              n_non_stop_words 
##                             0                             0 
##      n_non_stop_unique_tokens                     num_hrefs 
##                             0                             0 
##                num_self_hrefs                      num_imgs 
##                             0                             0 
##                    num_videos          average_token_length 
##                             0                             0 
##                  num_keywords     data_channel_is_lifestyle 
##                             0                             0 
## data_channel_is_entertainment           data_channel_is_bus 
##                             0                             0 
##        data_channel_is_socmed          data_channel_is_tech 
##                             0                             0 
##         data_channel_is_world                    kw_min_min 
##                             0                             0 
##                    kw_max_min                    kw_avg_min 
##                             0                             0 
##                    kw_min_max                    kw_max_max 
##                             0                             0 
##                    kw_avg_max                    kw_min_avg 
##                             0                             0 
##                    kw_max_avg                    kw_avg_avg 
##                             0                             0 
##     self_reference_min_shares     self_reference_max_shares 
##                             0                             0 
##    self_reference_avg_sharess             weekday_is_monday 
##                             0                             0 
##            weekday_is_tuesday          weekday_is_wednesday 
##                             0                             0 
##           weekday_is_thursday             weekday_is_friday 
##                             0                             0 
##           weekday_is_saturday             weekday_is_sunday 
##                             0                             0 
##                    is_weekend                        LDA_00 
##                             0                             0 
##                        LDA_01                        LDA_02 
##                             0                             0 
##                        LDA_03                        LDA_04 
##                             0                             0 
##           global_subjectivity     global_sentiment_polarity 
##                             0                             0 
##    global_rate_positive_words    global_rate_negative_words 
##                             0                             0 
##           rate_positive_words           rate_negative_words 
##                             0                             0 
##         avg_positive_polarity         min_positive_polarity 
##                             0                             0 
##         max_positive_polarity         avg_negative_polarity 
##                             0                             0 
##         min_negative_polarity         max_negative_polarity 
##                             0                             0 
##            title_subjectivity      title_sentiment_polarity 
##                             0                             0 
##        abs_title_subjectivity  abs_title_sentiment_polarity 
##                             0                             0 
##                     targetVar 
##                             0

2.b) Data visualizations

# Boxplots for each attribute
# par(mfrow=c(dispRow,dispCol))
for(i in 1:totAttr) {
    boxplot(x_train[,i], main=names(x_train)[i])
}

# Histograms each attribute
# par(mfrow=c(dispRow,dispCol))
for(i in 1:totAttr) {
    hist(x_train[,i], main=names(x_train)[i])
}

# Density plot for each attribute
# par(mfrow=c(dispRow,dispCol))
for(i in 1:totAttr) {
    plot(density(x_train[,i]), main=names(x_train)[i])
}

# Correlation plot
correlations <- cor(x_train)
corrplot(correlations, method="circle")

email_notify(paste("Data Summary and Visualization Completed!",date()))
## [1] "Java-Object{org.apache.commons.mail.SimpleEmail@34340fab}"

3. Prepare Data

Some dataset may require additional preparation activities that will best exposes the structure of the problem and the relationships between the input attributes and the output variable. Some data-prep tasks might include:

3.a) Data Cleaning

# Not applicable for this iteration of the project.

# Mark missing values
#invalid <- 0
#entireDataset$some_col[entireDataset$some_col==invalid] <- NA

# Impute missing values
#entireDataset$some_col <- with(entireDataset, impute(some_col, mean))

3.b) Feature Selection

# Using the Linear Regression (lm) algorithm, we perform the Recursive Feature Elimination (RFE) technique
startTimeModule <- proc.time()
set.seed(seedNum)
rfeCTRL <- rfeControl(functions=lmFuncs, method="repeatedcv", repeats=2)
rfeResults <- rfe(xy_train[,1:totAttr], xy_train[,totCol], sizes=c(30:50), rfeControl=rfeCTRL)
## Warning in predict.lm(object, x): prediction from a rank-deficient fit may
## be misleading

## Warning in predict.lm(object, x): prediction from a rank-deficient fit may
## be misleading

## Warning in predict.lm(object, x): prediction from a rank-deficient fit may
## be misleading

## Warning in predict.lm(object, x): prediction from a rank-deficient fit may
## be misleading

## Warning in predict.lm(object, x): prediction from a rank-deficient fit may
## be misleading

## Warning in predict.lm(object, x): prediction from a rank-deficient fit may
## be misleading

## Warning in predict.lm(object, x): prediction from a rank-deficient fit may
## be misleading

## Warning in predict.lm(object, x): prediction from a rank-deficient fit may
## be misleading

## Warning in predict.lm(object, x): prediction from a rank-deficient fit may
## be misleading

## Warning in predict.lm(object, x): prediction from a rank-deficient fit may
## be misleading

## Warning in predict.lm(object, x): prediction from a rank-deficient fit may
## be misleading

## Warning in predict.lm(object, x): prediction from a rank-deficient fit may
## be misleading

## Warning in predict.lm(object, x): prediction from a rank-deficient fit may
## be misleading

## Warning in predict.lm(object, x): prediction from a rank-deficient fit may
## be misleading

## Warning in predict.lm(object, x): prediction from a rank-deficient fit may
## be misleading

## Warning in predict.lm(object, x): prediction from a rank-deficient fit may
## be misleading

## Warning in predict.lm(object, x): prediction from a rank-deficient fit may
## be misleading

## Warning in predict.lm(object, x): prediction from a rank-deficient fit may
## be misleading

## Warning in predict.lm(object, x): prediction from a rank-deficient fit may
## be misleading

## Warning in predict.lm(object, x): prediction from a rank-deficient fit may
## be misleading
print(rfeResults)
## 
## Recursive feature selection
## 
## Outer resampling method: Cross-Validated (10 fold, repeated 2 times) 
## 
## Resampling performance over subset size:
## 
##  Variables  RMSE Rsquared  MAE RMSESD RsquaredSD MAESD Selected
##         30 12926  0.01561 3137   8250    0.01166 266.8         
##         31 13098  0.01564 3140   8697    0.01168 272.7         
##         32 13393  0.01577 3144   9607    0.01168 287.7         
##         33 13358  0.01596 3143   9452    0.01189 284.3         
##         34 13329  0.01613 3142   9375    0.01193 283.4         
##         35 12025  0.01614 3115   6823    0.01196 225.4         
##         36 11939  0.01626 3113   6742    0.01208 222.3         
##         37 11768  0.01639 3107   6651    0.01201 217.3         
##         38 11416  0.01645 3100   5393    0.01216 202.2         
##         39 10417  0.01662 3076   3596    0.01203 182.8         
##         40 10465  0.01762 3074   3599    0.01313 185.4         
##         41 10614  0.01786 3077   3587    0.01309 185.0         
##         42 10622  0.01798 3077   3588    0.01305 184.8         
##         43 10634  0.01845 3072   3592    0.01310 187.1         
##         44 10314  0.02003 3050   3620    0.01200 176.6         
##         45 10357  0.02484 3029   3591    0.01656 177.1         
##         46 10328  0.02377 3029   3605    0.01520 177.2         
##         47 10246  0.02459 3026   3659    0.01436 177.4        *
##         48 10430  0.02348 3034   3679    0.01503 184.9         
##         49 10458  0.02460 3034   3685    0.01583 184.9         
##         50 10408  0.02467 3033   3688    0.01550 183.8         
##         58 10276  0.02648 3024   3683    0.01559 180.7         
## 
## The top 5 variables (out of 47):
##    LDA_04, LDA_02, LDA_01, LDA_00, LDA_03
rfeAttributes <- predictors(rfeResults)
cat('Number of attributes identified from the RFE algorithm:',length(rfeAttributes))
## Number of attributes identified from the RFE algorithm: 47
print(rfeAttributes)
##  [1] "LDA_04"                        "LDA_02"                       
##  [3] "LDA_01"                        "LDA_00"                       
##  [5] "LDA_03"                        "global_rate_positive_words"   
##  [7] "n_unique_tokens"               "global_rate_negative_words"   
##  [9] "global_subjectivity"           "min_positive_polarity"        
## [11] "n_non_stop_unique_tokens"      "rate_positive_words"          
## [13] "global_sentiment_polarity"     "n_non_stop_words"             
## [15] "data_channel_is_entertainment" "rate_negative_words"          
## [17] "avg_negative_polarity"         "data_channel_is_lifestyle"    
## [19] "title_sentiment_polarity"      "weekday_is_saturday"          
## [21] "average_token_length"          "abs_title_subjectivity"       
## [23] "avg_positive_polarity"         "data_channel_is_socmed"       
## [25] "data_channel_is_bus"           "max_negative_polarity"        
## [27] "max_positive_polarity"         "min_negative_polarity"        
## [29] "weekday_is_monday"             "weekday_is_thursday"          
## [31] "data_channel_is_world"         "weekday_is_tuesday"           
## [33] "data_channel_is_tech"          "weekday_is_friday"            
## [35] "abs_title_sentiment_polarity"  "n_tokens_title"               
## [37] "title_subjectivity"            "weekday_is_wednesday"         
## [39] "num_self_hrefs"                "num_keywords"                 
## [41] "num_hrefs"                     "num_videos"                   
## [43] "num_imgs"                      "kw_min_min"                   
## [45] "kw_avg_avg"                    "n_tokens_content"             
## [47] "kw_avg_min"
plot(rfeResults, type=c("g", "o"))

# Removing the unselected attributes from the training and validation dataframes
rfeAttributes <- c(rfeAttributes,"targetVar")
xy_train <- xy_train[, (names(xy_train) %in% rfeAttributes)]
xy_test <- xy_test[, (names(xy_test) %in% rfeAttributes)]

3.c) Data Transforms

# Not applicable for this iteration of the project.
proc.time()-startTimeScript
##    user  system elapsed 
##  79.008   0.745  83.113
email_notify(paste("Data Cleaning and Transformation Completed!",date()))
## [1] "Java-Object{org.apache.commons.mail.SimpleEmail@2aaf7cc2}"

4. Model and Evaluate Algorithms

After the data-prep, we next work on finding a workable model by evaluating a subset of machine learning algorithms that are good at exploiting the structure of the dataset. The typical evaluation tasks include:

For this project, we will evaluate four linear, three non-linear, and three ensemble algorithms:

Linear Algorithms: Linear Regression, Ridge, LASSO, and ElasticNet

Non-Linear Algorithms: Decision Trees (CART), k-Nearest Neighbors, and Support Vector Machine

Ensemble Algorithms: Bagged CART, Random Forest, and Stochastic Gradient Boosting

The random number seed is reset before each run to ensure that the evaluation of each algorithm is performed using the same data splits. It ensures the results are directly comparable.

4.a) Generate models using linear algorithms

# Linear Regression (Regression)
startTimeModule <- proc.time()
set.seed(seedNum)
fit.lm <- train(targetVar~., data=xy_train, method="lm", metric=metricTarget, trControl=control)
## Warning in predict.lm(modelFit, newdata): prediction from a rank-deficient
## fit may be misleading
print(fit.lm)
## Linear Regression 
## 
## 27752 samples
##    47 predictor
## 
## No pre-processing
## Resampling: Cross-Validated (10 fold, repeated 1 times) 
## Summary of sample sizes: 24977, 24978, 24977, 24977, 24977, 24976, ... 
## Resampling results:
## 
##   RMSE      Rsquared    MAE     
##   11086.22  0.02402052  3053.665
## 
## Tuning parameter 'intercept' was held constant at a value of TRUE
proc.time()-startTimeModule
##    user  system elapsed 
##   3.672   0.116   3.831
email_notify(paste("Linear Regression Modeling Completed!",date()))
## [1] "Java-Object{org.apache.commons.mail.SimpleEmail@357246de}"
# Ridge (Regression)
startTimeModule <- proc.time()
set.seed(seedNum)
fit.ridge <- train(targetVar~., data=xy_train, method="ridge", metric=metricTarget, trControl=control)
print(fit.ridge)
## Ridge Regression 
## 
## 27752 samples
##    47 predictor
## 
## No pre-processing
## Resampling: Cross-Validated (10 fold, repeated 1 times) 
## Summary of sample sizes: 24977, 24978, 24977, 24977, 24977, 24976, ... 
## Resampling results across tuning parameters:
## 
##   lambda  RMSE      Rsquared    MAE     
##   0e+00   11086.22  0.02402052  3053.665
##   1e-04   10920.75  0.02410280  3048.083
##   1e-01   10327.80  0.02531409  3021.739
## 
## RMSE was used to select the optimal model using the smallest value.
## The final value used for the model was lambda = 0.1.
proc.time()-startTimeModule
##    user  system elapsed 
##  26.246   0.690  27.235
email_notify(paste("Ridge Modeling Completed!",date()))
## [1] "Java-Object{org.apache.commons.mail.SimpleEmail@23223dd8}"
# lasso (Regression)
startTimeModule <- proc.time()
set.seed(seedNum)
fit.lasso <- train(targetVar~., data=xy_train, method="lasso", metric=metricTarget, trControl=control)
print(fit.lasso)
## The lasso 
## 
## 27752 samples
##    47 predictor
## 
## No pre-processing
## Resampling: Cross-Validated (10 fold, repeated 1 times) 
## Summary of sample sizes: 24977, 24978, 24977, 24977, 24977, 24976, ... 
## Resampling results across tuning parameters:
## 
##   fraction  RMSE      Rsquared    MAE     
##   0.1       10331.70  0.02463008  3026.996
##   0.5       10339.22  0.02441088  3024.783
##   0.9       10688.93  0.02404296  3042.317
## 
## RMSE was used to select the optimal model using the smallest value.
## The final value used for the model was fraction = 0.1.
proc.time()-startTimeModule
##    user  system elapsed 
##  10.402   0.666  11.194
email_notify(paste("Lasso Modeling Completed!",date()))
## [1] "Java-Object{org.apache.commons.mail.SimpleEmail@19bb089b}"
# ElasticNet (Regression)
startTimeModule <- proc.time()
set.seed(seedNum)
fit.en <- train(targetVar~., data=xy_train, method="enet", metric=metricTarget, trControl=control)
print(fit.en)
## Elasticnet 
## 
## 27752 samples
##    47 predictor
## 
## No pre-processing
## Resampling: Cross-Validated (10 fold, repeated 1 times) 
## Summary of sample sizes: 24977, 24978, 24977, 24977, 24977, 24976, ... 
## Resampling results across tuning parameters:
## 
##   lambda  fraction  RMSE      Rsquared    MAE     
##   0e+00   0.050     10331.90  0.02491503  3029.262
##   0e+00   0.525     10388.08  0.02413528  3029.526
##   0e+00   1.000     11086.22  0.02402052  3053.665
##   1e-04   0.050     10358.28  0.02396919  3073.611
##   1e-04   0.525     10380.94  0.02424434  3027.400
##   1e-04   1.000     10920.75  0.02410280  3048.083
##   1e-01   0.050     10396.22  0.02379999  3114.598
##   1e-01   0.525     10323.25  0.02647541  3018.860
##   1e-01   1.000     10327.80  0.02531409  3021.739
## 
## RMSE was used to select the optimal model using the smallest value.
## The final values used for the model were fraction = 0.525 and lambda = 0.1.
proc.time()-startTimeModule
##    user  system elapsed 
##  27.053   1.259  28.657
email_notify(paste("ElasticNet Modeling Completed!",date()))
## [1] "Java-Object{org.apache.commons.mail.SimpleEmail@2ff4f00f}"

4.b) Generate models using nonlinear algorithms

# Decision Tree - CART (Regression/Classification)
startTimeModule <- proc.time()
set.seed(seedNum)
fit.cart <- train(targetVar~., data=xy_train, method="rpart", metric=metricTarget, trControl=control)
## Warning in nominalTrainWorkflow(x = x, y = y, wts = weights, info =
## trainInfo, : There were missing values in resampled performance measures.
print(fit.cart)
## CART 
## 
## 27752 samples
##    47 predictor
## 
## No pre-processing
## Resampling: Cross-Validated (10 fold, repeated 1 times) 
## Summary of sample sizes: 24977, 24978, 24977, 24977, 24977, 24976, ... 
## Resampling results across tuning parameters:
## 
##   cp           RMSE      Rsquared     MAE     
##   0.008022312  10442.48  0.014992953  3065.863
##   0.009380481  10410.95  0.015070346  3081.016
##   0.012215379  10405.46  0.009551093  3112.198
## 
## RMSE was used to select the optimal model using the smallest value.
## The final value used for the model was cp = 0.01221538.
proc.time()-startTimeModule
##    user  system elapsed 
##  17.263   0.138  17.594
email_notify(paste("Decision Tree Modeling Completed!",date()))
## [1] "Java-Object{org.apache.commons.mail.SimpleEmail@3cb5cdba}"
# k-Nearest Neighbors (Regression/Classification)
startTimeModule <- proc.time()
set.seed(seedNum)
fit.knn <- train(targetVar~., data=xy_train, method="knn", metric=metricTarget, trControl=control)
print(fit.knn)
## k-Nearest Neighbors 
## 
## 27752 samples
##    47 predictor
## 
## No pre-processing
## Resampling: Cross-Validated (10 fold, repeated 1 times) 
## Summary of sample sizes: 24977, 24978, 24977, 24977, 24977, 24976, ... 
## Resampling results across tuning parameters:
## 
##   k  RMSE      Rsquared     MAE     
##   5  11615.10  0.003031659  3415.888
##   7  11252.04  0.003658338  3344.773
##   9  11012.61  0.004149603  3288.142
## 
## RMSE was used to select the optimal model using the smallest value.
## The final value used for the model was k = 9.
proc.time()-startTimeModule
##    user  system elapsed 
## 138.576   0.122 140.204
email_notify(paste("k-Nearest Neighbors Modeling Completed!",date()))
## [1] "Java-Object{org.apache.commons.mail.SimpleEmail@1cd072a9}"
# Support Vector Machine (Regression/Classification)
startTimeModule <- proc.time()
set.seed(seedNum)
fit.svm <- train(targetVar~., data=xy_train, method="svmRadial", metric=metricTarget, trControl=control)
print(fit.svm)
## Support Vector Machines with Radial Basis Function Kernel 
## 
## 27752 samples
##    47 predictor
## 
## No pre-processing
## Resampling: Cross-Validated (10 fold, repeated 1 times) 
## Summary of sample sizes: 24977, 24978, 24977, 24977, 24977, 24976, ... 
## Resampling results across tuning parameters:
## 
##   C     RMSE      Rsquared    MAE     
##   0.25  10441.92  0.02206080  2459.608
##   0.50  10434.64  0.02118205  2469.632
##   1.00  10426.44  0.01957504  2487.661
## 
## Tuning parameter 'sigma' was held constant at a value of 0.01409941
## RMSE was used to select the optimal model using the smallest value.
## The final values used for the model were sigma = 0.01409941 and C = 1.
proc.time()-startTimeModule
##     user   system  elapsed 
## 8714.400   13.778 8831.428
email_notify(paste("Support Vector Machine Modeling Completed!",date()))
## [1] "Java-Object{org.apache.commons.mail.SimpleEmail@5594a1b5}"

4.c) Generate models using ensemble algorithms

In this section, we will explore the use and tuning of ensemble algorithms to see whether we can improve the results.

# Bagged CART (Regression/Classification)
startTimeModule <- proc.time()
set.seed(seedNum)
fit.bagcart <- train(targetVar~., data=xy_train, method="treebag", metric=metricTarget, trControl=control)
print(fit.bagcart)
## Bagged CART 
## 
## 27752 samples
##    47 predictor
## 
## No pre-processing
## Resampling: Cross-Validated (10 fold, repeated 1 times) 
## Summary of sample sizes: 24977, 24978, 24977, 24977, 24977, 24976, ... 
## Resampling results:
## 
##   RMSE      Rsquared    MAE     
##   10453.83  0.01079349  3073.201
proc.time()-startTimeModule
##    user  system elapsed 
## 114.815   0.672 116.757
email_notify(paste("Bagged CART Modeling Completed!",date()))
## [1] "Java-Object{org.apache.commons.mail.SimpleEmail@39ba5a14}"
# Random Forest (Regression/Classification)
startTimeModule <- proc.time()
set.seed(seedNum)
fit.rf <- train(targetVar~., data=xy_train, method="rf", metric=metricTarget, trControl=control)
print(fit.rf)
## Random Forest 
## 
## 27752 samples
##    47 predictor
## 
## No pre-processing
## Resampling: Cross-Validated (10 fold, repeated 1 times) 
## Summary of sample sizes: 24977, 24978, 24977, 24977, 24977, 24976, ... 
## Resampling results across tuning parameters:
## 
##   mtry  RMSE      Rsquared    MAE     
##    2    10332.95  0.02387294  3122.195
##   24    10568.35  0.01543550  3332.093
##   47    11019.27  0.01016815  3391.826
## 
## RMSE was used to select the optimal model using the smallest value.
## The final value used for the model was mtry = 2.
proc.time()-startTimeModule
##     user   system  elapsed 
## 43297.04    37.91 43808.78
email_notify(paste("Random Forest Modeling Completed!",date()))
## [1] "Java-Object{org.apache.commons.mail.SimpleEmail@71be98f5}"
# Stochastic Gradient Boosting (Regression/Classification)
startTimeModule <- proc.time()
set.seed(seedNum)
fit.gbm <- train(targetVar~., data=xy_train, method="gbm", metric=metricTarget, trControl=control, verbose=F)
print(fit.gbm)
## Stochastic Gradient Boosting 
## 
## 27752 samples
##    47 predictor
## 
## No pre-processing
## Resampling: Cross-Validated (10 fold, repeated 1 times) 
## Summary of sample sizes: 24977, 24978, 24977, 24977, 24977, 24976, ... 
## Resampling results across tuning parameters:
## 
##   interaction.depth  n.trees  RMSE      Rsquared    MAE     
##   1                   50      10340.06  0.02222229  3042.887
##   1                  100      10341.54  0.02200986  3044.364
##   1                  150      10339.66  0.02258045  3039.084
##   2                   50      10418.27  0.01317292  3069.294
##   2                  100      10471.26  0.01139697  3095.352
##   2                  150      10519.68  0.01012141  3115.487
##   3                   50      10433.49  0.01308168  3082.990
##   3                  100      10497.16  0.01130203  3106.313
##   3                  150      10544.77  0.01005860  3132.840
## 
## Tuning parameter 'shrinkage' was held constant at a value of 0.1
## 
## Tuning parameter 'n.minobsinnode' was held constant at a value of 10
## RMSE was used to select the optimal model using the smallest value.
## The final values used for the model were n.trees = 150,
##  interaction.depth = 1, shrinkage = 0.1 and n.minobsinnode = 10.
proc.time()-startTimeModule
##    user  system elapsed 
## 168.514   0.416 170.735
email_notify(paste("Stochastic Gradient Boosting Modeling Completed!",date()))
## [1] "Java-Object{org.apache.commons.mail.SimpleEmail@60f82f98}"

4.d) Compare baseline algorithms

results <- resamples(list(LR=fit.lm, RIDGE=fit.ridge, LASSO=fit.lasso, EN=fit.en, CART=fit.cart, kNN=fit.knn, SVM=fit.svm, BagCART=fit.bagcart, RF=fit.rf, GBM=fit.gbm))
summary(results)
## 
## Call:
## summary.resamples(object = results)
## 
## Models: LR, RIDGE, LASSO, EN, CART, kNN, SVM, BagCART, RF, GBM 
## Number of resamples: 10 
## 
## MAE 
##             Min.  1st Qu.   Median     Mean  3rd Qu.     Max. NA's
## LR      2855.464 2909.083 3012.243 3053.665 3108.379 3569.786    0
## RIDGE   2859.562 2907.244 3011.504 3021.739 3105.940 3254.649    0
## LASSO   2855.447 2908.832 3012.244 3026.996 3105.958 3306.624    0
## EN      2857.180 2902.507 3011.071 3018.860 3102.361 3267.184    0
## CART    2963.559 3010.490 3126.046 3112.198 3168.049 3355.742    0
## kNN     3113.805 3186.286 3263.199 3288.142 3356.803 3529.913    0
## SVM     2280.032 2390.282 2512.550 2487.661 2588.172 2713.174    0
## BagCART 2870.441 2960.687 3096.382 3073.201 3159.055 3347.272    0
## RF      2945.335 3040.990 3106.764 3122.195 3208.636 3375.957    0
## GBM     2859.981 2927.264 3045.145 3039.084 3121.259 3266.608    0
## 
## RMSE 
##             Min.  1st Qu.   Median     Mean  3rd Qu.     Max. NA's
## LR      6468.860 7477.418 9349.197 11086.22 13456.45 23302.46    0
## RIDGE   6475.178 7472.260 9347.424 10327.80 13456.78 15735.72    0
## LASSO   6468.851 7477.381 9347.849 10331.70 13456.43 15760.19    0
## EN      6462.406 7468.497 9338.927 10323.25 13462.80 15752.18    0
## CART    6553.440 7633.558 9390.836 10405.46 13515.17 15792.26    0
## kNN     7549.125 8399.398 9916.191 11012.61 13952.92 16204.18    0
## SVM     6496.521 7628.757 9450.810 10426.44 13565.78 15796.58    0
## BagCART 6476.117 7857.534 9396.113 10453.83 13596.12 15816.49    0
## RF      6480.871 7502.368 9344.685 10332.95 13479.44 15661.77    0
## GBM     6444.068 7519.640 9333.632 10339.66 13504.61 15748.42    0
## 
## Rsquared 
##                 Min.     1st Qu.      Median        Mean     3rd Qu.
## LR      3.742948e-05 0.013514075 0.026757863 0.024020521 0.034991505
## RIDGE   7.987339e-03 0.013506063 0.026533517 0.025314090 0.035623522
## LASSO   5.490973e-03 0.013515667 0.026756875 0.024630078 0.035134783
## EN      6.501294e-03 0.012847109 0.026507776 0.026475414 0.035127230
## CART    1.789651e-03 0.006260957 0.008512304 0.009551093 0.012280199
## kNN     5.790300e-04 0.001478678 0.003730796 0.004149603 0.006020437
## SVM     8.804711e-03 0.011775682 0.017012320 0.019575044 0.022351600
## BagCART 1.803747e-03 0.006538819 0.008017699 0.010793490 0.013587626
## RF      9.393580e-03 0.015251281 0.024254548 0.023872943 0.032494516
## GBM     6.540232e-03 0.010230344 0.023394023 0.022580451 0.034540579
##                Max. NA's
## LR      0.043137813    0
## RIDGE   0.044167729    0
## LASSO   0.043143424    0
## EN      0.058046151    0
## CART    0.019445628    4
## kNN     0.009344345    0
## SVM     0.043237416    0
## BagCART 0.026706985    0
## RF      0.040481466    0
## GBM     0.037784149    0
dotplot(results)

cat('The average RMSE from all models is:',
    mean(c(results$values$`LR~RMSE`, results$values$`RIDGE~RMSE`, results$values$`LASSO~RMSE`, results$values$`EN~RMSE`, results$values$`CART~RMSE`, results$values$`kNN~RMSE`, results$values$`SVM~RMSE`, results$values$`BagCART~RMSE`, results$values$`RF~RMSE`, results$values$`GBM~RMSE`)))
## The average RMSE from all models is: 10503.99
email_notify(paste("Baseline Modeling Completed!",date()))
## [1] "Java-Object{org.apache.commons.mail.SimpleEmail@6108b2d7}"

5. Improve Accuracy or Results

After we achieve a short list of machine learning algorithms with good level of accuracy, we can leverage ways to improve the accuracy of the models.

Using the two best-perfoming algorithms from the previous section, we will Search for a combination of parameters for each algorithm that yields the best results.

5.a) Algorithm Tuning

Finally, we will tune the best-performing algorithms from each group further and see whether we can get more accuracy out of them.

# Tuning algorithm #1 - ElasticNet
startTimeModule <- proc.time()
set.seed(seedNum)
grid <- expand.grid(lambda=c(0.001,0.01,0.1,1))
fit.final1 <- train(targetVar~., data=xy_train, method="ridge", metric=metricTarget, tuneGrid=grid, trControl=control)
plot(fit.final1)

print(fit.final1)
## Ridge Regression 
## 
## 27752 samples
##    47 predictor
## 
## No pre-processing
## Resampling: Cross-Validated (10 fold, repeated 1 times) 
## Summary of sample sizes: 24977, 24978, 24977, 24977, 24977, 24976, ... 
## Resampling results across tuning parameters:
## 
##   lambda  RMSE      Rsquared    MAE     
##   0.001   10361.45  0.02425259  3025.748
##   0.010   10535.44  0.02419578  3035.269
##   0.100   10327.80  0.02531409  3021.739
##   1.000   11228.92  0.02445028  3108.861
## 
## RMSE was used to select the optimal model using the smallest value.
## The final value used for the model was lambda = 0.1.
proc.time()-startTimeModule
##    user  system elapsed 
##  26.887   0.069  27.268
email_notify(paste("Algorithm #1 Tuning Completed!",date()))
## [1] "Java-Object{org.apache.commons.mail.SimpleEmail@6aaa5eb0}"
# Tuning algorithm #2 - LASSO
startTimeModule <- proc.time()
set.seed(seedNum)
grid <- expand.grid(fraction=c(0.1,0.4,0.7,1.0))
fit.final2 <- train(targetVar~., data=xy_train, method="lasso", metric=metricTarget, tuneGrid=grid, trControl=control)
plot(fit.final2)

print(fit.final2)
## The lasso 
## 
## 27752 samples
##    47 predictor
## 
## No pre-processing
## Resampling: Cross-Validated (10 fold, repeated 1 times) 
## Summary of sample sizes: 24977, 24978, 24977, 24977, 24977, 24976, ... 
## Resampling results across tuning parameters:
## 
##   fraction  RMSE      Rsquared    MAE     
##   0.1       10331.70  0.02463008  3026.996
##   0.4       10328.45  0.02495078  3021.591
##   0.7       10949.14  0.02403318  3049.885
##   1.0       11086.22  0.02402052  3053.665
## 
## RMSE was used to select the optimal model using the smallest value.
## The final value used for the model was fraction = 0.4.
proc.time()-startTimeModule
##    user  system elapsed 
##   9.129   0.036   9.274
email_notify(paste("Algorithm #2 Tuning Completed!",date()))
## [1] "Java-Object{org.apache.commons.mail.SimpleEmail@5f2050f6}"
# Tuning algorithm #3 - ElasticNet
startTimeModule <- proc.time()
set.seed(seedNum)
grid <- expand.grid(lambda=c(0.001,0.01,0.1,1), fraction=c(0.1,0.4,0.7,1.0))
fit.final3 <- train(targetVar~., data=xy_train, method="enet", metric=metricTarget, tuneGrid=grid, trControl=control)
plot(fit.final3)

print(fit.final3)
## Elasticnet 
## 
## 27752 samples
##    47 predictor
## 
## No pre-processing
## Resampling: Cross-Validated (10 fold, repeated 1 times) 
## Summary of sample sizes: 24977, 24978, 24977, 24977, 24977, 24976, ... 
## Resampling results across tuning parameters:
## 
##   lambda  fraction  RMSE      Rsquared    MAE     
##   0.001   0.1       10353.86  0.02406829  3067.740
##   0.001   0.4       10320.80  0.02664446  3016.074
##   0.001   0.7       10869.15  0.02450979  3044.051
##   0.001   1.0       10361.45  0.02425259  3025.748
##   0.010   0.1       10362.87  0.02361705  3076.726
##   0.010   0.4       10321.73  0.02673645  3019.238
##   0.010   0.7       10587.71  0.02495672  3033.208
##   0.010   1.0       10535.44  0.02419578  3035.269
##   0.100   0.1       10368.39  0.02361158  3083.007
##   0.100   0.4       10322.69  0.02681851  3023.327
##   0.100   0.7       10393.32  0.02537044  3024.840
##   0.100   1.0       10327.80  0.02531409  3021.739
##   1.000   0.1       10367.25  0.02430050  3082.299
##   1.000   0.4       10332.78  0.02526869  3032.705
##   1.000   0.7       10893.06  0.02461167  3070.876
##   1.000   1.0       11228.92  0.02445028  3108.861
## 
## RMSE was used to select the optimal model using the smallest value.
## The final values used for the model were fraction = 0.4 and lambda = 0.001.
proc.time()-startTimeModule
##    user  system elapsed 
##  27.609   0.199  28.137
email_notify(paste("Algorithm #3 Tuning Completed!",date()))
## [1] "Java-Object{org.apache.commons.mail.SimpleEmail@62043840}"

5.d) Compare Algorithms After Tuning

results <- resamples(list(RIDGE=fit.final1, LASSO=fit.final2, ENET=fit.final3))
summary(results)
## 
## Call:
## summary.resamples(object = results)
## 
## Models: RIDGE, LASSO, ENET 
## Number of resamples: 10 
## 
## MAE 
##           Min.  1st Qu.   Median     Mean  3rd Qu.     Max. NA's
## RIDGE 2859.562 2907.244 3011.504 3021.739 3105.940 3254.649    0
## LASSO 2855.453 2909.063 3012.243 3021.591 3107.380 3250.385    0
## ENET  2852.412 2902.851 3010.354 3016.074 3099.932 3251.209    0
## 
## RMSE 
##           Min.  1st Qu.   Median     Mean  3rd Qu.     Max. NA's
## RIDGE 6475.178 7472.260 9347.424 10327.80 13456.78 15735.72    0
## LASSO 6468.849 7477.420 9348.695 10328.45 13456.44 15725.85    0
## ENET  6455.963 7471.497 9336.962 10320.80 13462.75 15726.91    0
## 
## Rsquared 
##              Min.    1st Qu.     Median       Mean    3rd Qu.       Max.
## RIDGE 0.007987339 0.01350606 0.02653352 0.02531409 0.03562352 0.04416773
## LASSO 0.007234833 0.01351514 0.02675759 0.02495078 0.03504470 0.04314156
## ENET  0.008412791 0.01287458 0.02712340 0.02664446 0.03569260 0.05518764
##       NA's
## RIDGE    0
## LASSO    0
## ENET     0
dotplot(results)

6. Finalize Model and Present Results

Once we have narrow down to a model that we believe can make accurate predictions on unseen data, we are ready to finalize it. Finalizing a model may involve sub-tasks such as:

6.a) Predictions on validation dataset

predictions <- predict(fit.final3, newdata=xy_test)
print(RMSE(predictions, y_test))
## [1] 13049.92
print(R2(predictions, y_test))
## [1] 0.01219678

6.b) Create standalone model on entire training dataset

startTimeModule <- proc.time()
library(elasticnet)
## Loading required package: lars
## Loaded lars 1.2
set.seed(seedNum)
totCol <- ncol(xy_train)
totAttr <- totCol-1
#finalModel <- enet(xy_train[,1:totAttr], xy_train[,totCol], lambda=0.001)
#summary(finalModel)
proc.time()-startTimeModule
##    user  system elapsed 
##   0.025   0.000   0.025

6.c) Save model for later use

#saveRDS(finalModel, "./finalModel_Regression.rds")
proc.time()-startTimeScript
##      user    system   elapsed 
## 52668.197    57.012 53334.342
email_notify(paste("Model Validation and Final Model Creation Completed!",date()))
## [1] "Java-Object{org.apache.commons.mail.SimpleEmail@3f2a3a5}"